Statistical Machine Translation between Related Languages

نویسندگان

  • Pushpak Bhattacharyya
  • Mitesh M. Khapra
  • Anoop Kunchukuttan
چکیده

Language­independent Statistical Machine Translation (SMT) has proven to be very challenging. The diversity of languages makes high accuracy difficult and requires substantial parallel corpus as well as linguistic resources (parsers, morph analyzers, etc.). An interesting observation is that a large chunk of machine translation (MT) requirements involve related languages. They are either : (i) between related languages, or (ii) between a lingua franca (like English) and a set of related languages. For instance, India, the European Union and South­East Asia have such translation requirements due to government, business and socio­cultural communication needs. Related languages share a lot of linguistic features and the divergences among them are at a lower level of the NLP pipeline. The objective of the tutorial is to discuss how the relatedness among languages can be leveraged to bridge this language divergence thereby achieving some/all of these goals: (i) improving translation quality, (ii) achieving better generalization, (iii) sharing linguistic resources, and (iv) reducing resource requirements. We will look at the existing research in SMT from the perspective of related languages, with the goal to build a toolbox of methods that are useful for translation between related languages. This tutorial would be relevant to Machine Translation researchers and developers, especially those interested in translation between low­resource languages which have resource­rich related languages. It will also be relevant for researchers interested in multilingual computation. We start with a motivation for looking at the SMT problem from the perspective of related languages. We introduce notions of language relatedness useful for MT. We explore how lexical, morphological and syntactic similarity among related languages can help MT. Lexical similarity will receive special attention since related languages share a significant vocabulary in terms of cognates, loanwords, etc. Then, we look beyond bilingual MT and present how pivot­based and multi­source methods incorporate knowledge from multiple languages, and handle language pairs lacking parallel corpora. We present some studies concerning the implications of languages relatedness to pivot­based SMT, and ways of handling language divergence in the pivot­based SMT scenario. Recent advances in deep learning have made it possible to train multi­language neural MT systems, which we think would be relevant to training between related languages.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Correlation of Machine Translation Evaluation Metrics with Human Judgement on Persian Language

Machine Translation Evaluation Metrics (MTEMs) are the central core of Machine Translation (MT) engines as they are developed based on frequent evaluation. Although MTEMs are widespread today, their validity and quality for many languages is still under question. The aim of this research study was to examine the validity and assess the quality of MTEMs from Lexical Similarity set on machine tra...

متن کامل

Combining Word-Level and Character-Level Models for Machine Translation Between Closely-Related Languages

We propose several techniques for improving statistical machine translation between closely-related languages with scarce resources. We use character-level translation trained on n-gram-character-aligned bitexts and tuned using word-level BLEU, which we further augment with character-based transliteration at the word level and combine with a word-level translation model. The evaluation on Maced...

متن کامل

Language Related Issues for Machine Translation between Closely Related South Slavic Languages

Machine translation between closely related languages is less challenging and exhibits a smaller number of translation errors than translation between distant languages, but there are still obstacles which should be addressed in order to improve such systems. This work explores the obstacles for machine translation systems between closely related South Slavic languages, namely Croatian, Serbian...

متن کامل

Statistical Machine Translation Between Related and Unrelated Languages

In this paper we describe an attempt to compare how relatedness of languages can influence the performance of statistical machine translation (SMT). We apply the Moses toolkit on the Czech-English-Russian corpus UMC 0.1 in order to train two translation systems: Russian-Czech and English-Czech. The quality of the translation is evaluated on an independent test set of 1000 sentences parallel in ...

متن کامل

A Statistical Machine Translation Approach to Sinhala-Tamil Language Translation

Data-driven approaches to Machine Translation have come to the fore of Language Processing Research over the past decade. The relative success in terms of robustness of Example Based and Statistical approaches have given rise to a new optimism and an exploration of other data-driven approaches such as Maximum Entropy language modeling. Much of the work in the literature however, largely report ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016